========================================================
This project is an exploratory data analysis of campaign contributions for the 2012 presidential election in the state of Ohio. The state of Ohio is historically a critical swing state for presidential elections and, according to wikipedia, has the longest streak of matching the overall election outcome (since 1960). It would be a mistake to assume that campaign contributions are a predictor or even loosely correlated with votes (I might be biting my tongue if we were talking about superPACs though), but Ohio is one of the most frequently visited states on a presidential nominee’s campaign trail and for this reason it piqued my interest. I’ll explore the nature of campaign contributions and see if there are any interesting relationships in the data. I’ll also take a look at the geographic distribution of campaign contributions by zipcode. This exploration may shed light on the characteristics of a generous contributor and perhaps why particular party candidates visit certain areas in the state. Anything of significance could be used in future inferential or predictive analysis.
There are 3 data sets here:
We’ll link these datasets by zipcode much like tables in a database. The geographic and demographic data come from a library and are already nicely formatted. The features in the campaign contribution data, however, might need some explicit class declarations
Let’s peak at the first few rows to inspect the atomic classes (aka types).
df_look <- read.csv("campaign-contributions-ohio-2012.csv", nrows = 100)## Error in read.table(file = file, header = header, sep = sep, quote = quote, : duplicate 'row.names' are not allowed
While trying to read the file, we get some kind of error about the row names. Let’s explicitly set row.names = NULL to see if that fixes it.
df_look <- read.csv("campaign-contributions-ohio-2012.csv", nrows = 100,
row.names=NULL)## row.names cmte_id cand_id cand_nm contbr_nm
## 1 C00495820 P80000748 Paul, Ron BROWN, TODD W MR. BATAVIA
## 2 C00495820 P80000748 Paul, Ron DIEHL, MARGO SONJA CANTON
## 3 C00495820 P80000748 Paul, Ron KIRCHMEYER, BENJAMIN NORTH CANTON
## 4 C00431445 P80003338 Obama, Barack KEYES, STEPHEN BEXLEY
## 5 C00431445 P80003338 Obama, Barack MURPHY, MIKE W COLUMBUS
## 6 C00431445 P80003338 Obama, Barack ASHMAN, LEWIS J. DAYTON
## contbr_city contbr_st contbr_zip
## 1 OH 451034017 GENERAL ELECTRIC
## 2 OH 44718 NONE
## 3 OH 44720 DIEBOLD, INC
## 4 OH 432091491 NATIONWIDE MUTUAL INSURANCE CO.
## 5 OH 43214 ROCK TENN COMPANY
## 6 OH 454091226 KETTERING SCHOOL SYSTEM
## contbr_employer contbr_occupation contb_receipt_amt
## 1 ENGINEER 50.0 06-DEC-11
## 2 RETIRED 25.0 06-DEC-11
## 3 COMPUTER PROGRAMMER 201.2 06-DEC-11
## 4 HR EXECUTIVE / ATTORNEY 100.0 30-SEP-11
## 5 MANAGER 50.0 26-SEP-11
## 6 TEACHER 50.0 27-SEP-11
## contb_receipt_dt receipt_desc memo_cd memo_text form_tp file_num
## 1 NA NA NA SA17A 779227 0925592
## 2 NA NA NA SA17A 779227 0925663
## 3 NA NA NA SA17A 779227 0925696
## 4 NA NA NA SA17A 756218 C12432798
## 5 NA NA NA SA17A 756218 C12223508
## 6 NA NA NA SA17A 756218 C12234248
## tran_id election_tp
## 1 P2012 NA
## 2 P2012 NA
## 3 P2012 NA
## 4 P2012 NA
## 5 P2012 NA
## 6 P2012 NA
By setting the row names to NULL we forced row numbering but for some reason it used an existing column as the row number. Using head to look at the first 6 lines, it seems like there is a mismatch between the header and the rest of the data. Let’s manually open a connection to the file and read a few lines including the header to see if we can pinpoint the problem.
con <- file("campaign-contributions-ohio-2012.csv", "r")
lines <- readLines(con, 5); close(con); lines## [1] "cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp"
## [2] "C00495820,\"P80000748\",\"Paul, Ron\",\"BROWN, TODD W MR.\",\"BATAVIA\",\"OH\",\"451034017\",\"GENERAL ELECTRIC\",\"ENGINEER\",50,06-DEC-11,\"\",\"\",\"\",\"SA17A\",\"779227\",\"0925592\",\"P2012\","
## [3] "C00495820,\"P80000748\",\"Paul, Ron\",\"DIEHL, MARGO SONJA\",\"CANTON\",\"OH\",\"44718\",\"NONE\",\"RETIRED\",25,06-DEC-11,\"\",\"\",\"\",\"SA17A\",\"779227\",\"0925663\",\"P2012\","
## [4] "C00495820,\"P80000748\",\"Paul, Ron\",\"KIRCHMEYER, BENJAMIN\",\"NORTH CANTON\",\"OH\",\"44720\",\"DIEBOLD, INC\",\"COMPUTER PROGRAMMER\",201.2,06-DEC-11,\"\",\"\",\"\",\"SA17A\",\"779227\",\"0925696\",\"P2012\","
## [5] "C00431445,\"P80003338\",\"Obama, Barack\",\"KEYES, STEPHEN\",\"BEXLEY\",\"OH\",\"432091491\",\"NATIONWIDE MUTUAL INSURANCE CO.\",\"HR EXECUTIVE / ATTORNEY\",100,30-SEP-11,\"\",\"\",\"\",\"SA17A\",\"756218\",\"C12432798\",\"P2012\","
It’s difficult to tell from this output, but it looks like we have an extra empty column at the end of each row. We could use count.fields() but that opens up the entire file. Instead, let’s create a smaller test file to do this.
con <- file("campaign-contributions-ohio-2012.csv", "r")
lines <- readLines(con, 20); close(con)
# using write.csv() here seems to cause problems with escape characters
# using a connection object works better
con <- file("test.csv"); writeLines(lines, con); close(con)
count.fields("test.csv", sep = ",")## [1] 18 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19
This confirms that the data following the header somehow has an extra field. Looking back at the results from head and readlines, we can see that the extra field is full of NAs and it’s just an empty field caused by "," at the end of each row.
We can fix this a few different ways. We can set the row names to NULL again, shift the column names to the left, and then remove the last column from the dataframe. Or, we can get the header row by itself and then use it as the column name for the rest of the data without that last empty column. The second options seems easier so we’ll go with that. After, we can finally peak into the correctly structured data to see if we need to make any adjustments (like strings as factors).
## 'data.frame': 19 obs. of 18 variables:
## $ cmte_id : Factor w/ 2 levels "C00431445","C00495820": 2 2 2 1 1 1 1 1 1 2 ...
## $ cand_id : Factor w/ 2 levels "P80000748","P80003338": 1 1 1 2 2 2 2 2 2 1 ...
## $ cand_nm : Factor w/ 2 levels "Obama, Barack",..: 2 2 2 1 1 1 1 1 1 2 ...
## $ contbr_nm : Factor w/ 19 levels "ASHMAN, LEWIS J.",..: 4 5 11 10 15 1 12 3 9 14 ...
## $ contbr_city : Factor w/ 17 levels "BATAVIA","BEXLEY",..: 1 3 10 2 6 7 5 4 14 8 ...
## $ contbr_st : Factor w/ 1 level "OH": 1 1 1 1 1 1 1 1 1 1 ...
## $ contbr_zip : int 451034017 44718 44720 432091491 43214 454091226 441201574 452243150 45690 440775592 ...
## $ contbr_employer : Factor w/ 17 levels "CH2M HILL","DIEBOLD, INC",..: 6 10 2 9 13 7 12 12 11 4 ...
## $ contbr_occupation: Factor w/ 15 levels "BUSINESS OWNER",..: 3 11 2 4 6 15 10 11 14 12 ...
## $ contb_receipt_amt: num 50 25 201 100 50 ...
## $ contb_receipt_dt : Factor w/ 10 levels "04-AUG-11","06-DEC-11",..: 2 2 2 9 6 7 9 6 1 2 ...
## $ receipt_desc : logi NA NA NA NA NA NA ...
## $ memo_cd : logi NA NA NA NA NA NA ...
## $ memo_text : logi NA NA NA NA NA NA ...
## $ form_tp : Factor w/ 1 level "SA17A": 1 1 1 1 1 1 1 1 1 1 ...
## $ file_num : int 779227 779227 779227 756218 756218 756218 756218 756218 756218 779227 ...
## $ tran_id : Factor w/ 19 levels "0922774","0922801",..: 6 7 8 18 14 16 19 15 10 1 ...
## $ election_tp : Factor w/ 1 level "P2012": 1 1 1 1 1 1 1 1 1 1 ...
There are more features that we want as factors than characters so we’ll keep that default setting. Contributor name should be a character, and zipcode should also be a character since it looks like some zipcodes are the full 9 digits and thus we’re gonna have to do some string manipulation. The receipt date should be a date object but we can also set that manually after we read in the data.
Now we can read in the entire contributions file. We’ll have to skip the header again but since we already have the column names from the test dataframe we can set them easily. We add a dummy name to the column names variable to account for the empty entry and then delete that column afterwards.
column_names <- c(column_names, "dummy")
df <- read.csv("campaign-contributions-ohio-2012.csv",
header = FALSE, col.names = column_names, skip = 1)
df$dummy <- NULL; head(df)## cmte_id cand_id cand_nm contbr_nm contbr_city
## 1 C00495820 P80000748 Paul, Ron BROWN, TODD W MR. BATAVIA
## 2 C00495820 P80000748 Paul, Ron DIEHL, MARGO SONJA CANTON
## 3 C00495820 P80000748 Paul, Ron KIRCHMEYER, BENJAMIN NORTH CANTON
## 4 C00431445 P80003338 Obama, Barack KEYES, STEPHEN BEXLEY
## 5 C00431445 P80003338 Obama, Barack MURPHY, MIKE W COLUMBUS
## 6 C00431445 P80003338 Obama, Barack ASHMAN, LEWIS J. DAYTON
## contbr_st contbr_zip contbr_employer
## 1 OH 451034017 GENERAL ELECTRIC
## 2 OH 44718 NONE
## 3 OH 44720 DIEBOLD, INC
## 4 OH 432091491 NATIONWIDE MUTUAL INSURANCE CO.
## 5 OH 43214 ROCK TENN COMPANY
## 6 OH 454091226 KETTERING SCHOOL SYSTEM
## contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc
## 1 ENGINEER 50.0 06-DEC-11
## 2 RETIRED 25.0 06-DEC-11
## 3 COMPUTER PROGRAMMER 201.2 06-DEC-11
## 4 HR EXECUTIVE / ATTORNEY 100.0 30-SEP-11
## 5 MANAGER 50.0 26-SEP-11
## 6 TEACHER 50.0 27-SEP-11
## memo_cd memo_text form_tp file_num tran_id election_tp
## 1 SA17A 779227 0925592 P2012
## 2 SA17A 779227 0925663 P2012
## 3 SA17A 779227 0925696 P2012
## 4 SA17A 756218 C12432798 P2012
## 5 SA17A 756218 C12223508 P2012
## 6 SA17A 756218 C12234248 P2012
I’m curious to see how the readr package might have handled this, as it’s supposed to be more intuitive with reading flat files. The only drawback is that we would have to set the factors manually which could be pretty lengthy. In any case, let’s check it out.
df_readr <- read_csv("campaign-contributions-ohio-2012.csv",
n_max = 100)## Warning: 101 parsing failures.
## row col expected actual
## 1 -- 18 columns 19 columns
## 2 -- 18 columns 19 columns
## 3 -- 18 columns 19 columns
## 4 -- 18 columns 19 columns
## 5 -- 18 columns 19 columns
## ... ... .......... ..........
## .See problems(...) for more details.
## Classes 'tbl_df', 'tbl' and 'data.frame': 100 obs. of 18 variables:
## $ cmte_id : chr "C00495820" "C00495820" "C00495820" "C00431445" ...
## $ cand_id : chr "P80000748" "P80000748" "P80000748" "P80003338" ...
## $ cand_nm : chr "Paul, Ron" "Paul, Ron" "Paul, Ron" "Obama, Barack" ...
## $ contbr_nm : chr "BROWN, TODD W MR." "DIEHL, MARGO SONJA" "KIRCHMEYER, BENJAMIN" "KEYES, STEPHEN" ...
## $ contbr_city : chr "BATAVIA" "CANTON" "NORTH CANTON" "BEXLEY" ...
## $ contbr_st : chr "OH" "OH" "OH" "OH" ...
## $ contbr_zip : int 451034017 44718 44720 432091491 43214 454091226 441201574 452243150 45690 440775592 ...
## $ contbr_employer : chr "GENERAL ELECTRIC" "NONE" "DIEBOLD, INC" "NATIONWIDE MUTUAL INSURANCE CO." ...
## $ contbr_occupation: chr "ENGINEER" "RETIRED" "COMPUTER PROGRAMMER" "HR EXECUTIVE / ATTORNEY" ...
## $ contb_receipt_amt: num 50 25 201 100 50 ...
## $ contb_receipt_dt : chr "06-DEC-11" "06-DEC-11" "06-DEC-11" "30-SEP-11" ...
## $ receipt_desc : chr "" "" "" "" ...
## $ memo_cd : chr "" "" "" "" ...
## $ memo_text : chr "" "" "" "" ...
## $ form_tp : chr "SA17A" "SA17A" "SA17A" "SA17A" ...
## $ file_num : int 779227 779227 779227 756218 756218 756218 756218 756218 756218 779227 ...
## $ tran_id : chr "0925592" "0925663" "0925696" "C12432798" ...
## $ election_tp : chr "P2012" "P2012" "P2012" "P2012" ...
## - attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 101 obs. of 4 variables:
## ..$ row : int 1 2 3 4 5 6 7 8 9 10 ...
## ..$ col : chr NA NA NA NA ...
## ..$ expected: chr "18 columns" "18 columns" "18 columns" "18 columns" ...
## ..$ actual : chr "19 columns" "19 columns" "19 columns" "19 columns" ...
We got a warning about the column mismatch but readr handled the discrepancy. We could have done this instead and set the factors one by one but read.csv allows us to set all characters as factors which was advantageous. It took way more effort than readr would have, but it also allowed us to pinpoint the exact problem.
Back to loading the rest of the data. Let’s load the geographic and demographic data.
## 'data.frame': 5403044 obs. of 13 variables:
## $ long : num -88.4 -88.4 -88.4 -88.4 -88.4 ...
## $ lat : num 31.1 31.1 31.1 31.1 31.1 ...
## $ order : int 1 2 3 4 5 6 7 8 9 10 ...
## $ hole : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ piece : Factor w/ 21 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ group : Factor w/ 42275 levels "0.1","1.1","1.2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ id : chr "0" "0" "0" "0" ...
## $ ZCTA5CE10 : Factor w/ 32989 levels "01001","01002",..: 11619 11619 11619 11619 11619 11619 11619 11619 11619 11619 ...
## $ AFFGEOID10: Factor w/ 32989 levels "8600000US01001",..: 11619 11619 11619 11619 11619 11619 11619 11619 11619 11619 ...
## $ GEOID10 : Factor w/ 32989 levels "01001","01002",..: 11619 11619 11619 11619 11619 11619 11619 11619 11619 11619 ...
## $ ALAND10 : num 5.7e+08 5.7e+08 5.7e+08 5.7e+08 5.7e+08 ...
## $ AWATER10 : num 3241962 3241962 3241962 3241962 3241962 ...
## $ region : chr "36522" "36522" "36522" "36522" ...
## 'data.frame': 33120 obs. of 9 variables:
## $ region : chr "00601" "00602" "00603" "00606" ...
## $ total_population : num 18450 41302 53683 6591 28963 ...
## $ percent_white : num 1 4 2 0 1 0 0 1 2 0 ...
## $ percent_black : num 0 0 0 0 0 0 0 0 0 0 ...
## $ percent_asian : num 0 0 0 0 0 0 0 0 0 0 ...
## $ percent_hispanic : num 99 94 96 100 99 100 100 99 98 100 ...
## $ per_capita_income: num 7380 8463 9176 6383 7892 ...
## $ median_rent : num 285 319 252 230 334 315 285 338 400 319 ...
## $ median_age : num 36.6 38.6 38.9 37.3 39.2 38.5 40.9 36.2 42 39.7 ...
For both of these dataframes, the region variable (zipcode) is what we’ll be using to relate to the contributions data. For the demographic data, we can add the fields we want to the contributions dataframe. However, we’ll be making choropleth maps with the geographic data from different aggregate statistics from the contributions dataframe. So we’ll have to concatenate these 2 on the fly depending upon what we want to show.
Before adding the demographic data, let’s get the contributions dataframe prepared. Let’s get rid of the data we don’t want and rename the columns. There are quite a few features in the contribution data that don’t really help our exploration like the ID number of the candidate or the file number of the contribution. The only features of interest are candidate, name (of the contributor), city, state, zipcode, employer, occupation, amount, and date.
df <- df[, 3:11] # the features we want are consecutive
# rename the features to make them more readable
names(df) <- c("candidate", "name", "city", "state", "zipcode",
"employer", "occupation", "amount", "date")
summary(df)## candidate name city
## Obama, Barack :91286 WADE, ANNIE SANDS : 114 CINCINNATI: 18596
## Romney, Mitt :50672 DUPHIL, MONIQUE : 99 COLUMBUS : 13820
## Paul, Ron : 4271 RINGO, RICHARD A. MR.: 97 DAYTON : 5536
## Santorum, Rick: 2012 POWERS, KAREN : 90 CLEVELAND : 4748
## Gingrich, Newt: 1432 KNEELAND, HAROLD : 88 TOLEDO : 3276
## Cain, Herman : 583 RUPPER, DARVIS : 86 AKRON : 2965
## (Other) : 1223 (Other) :150905 (Other) :102538
## state zipcode
## OH:151479 43214 : 655
## 44122 : 586
## 44118 : 584
## 43202 : 511
## 44107 : 507
## 45220 : 476
## (Other):148160
## employer
## RETIRED :34269
## SELF-EMPLOYED :11503
## NOT EMPLOYED : 8657
## INFORMATION REQUESTED PER BEST EFFORTS: 6049
## INFORMATION REQUESTED : 4217
## (Other) :86741
## NA's : 43
## occupation amount
## RETIRED :38151 Min. :-15000.0
## INFORMATION REQUESTED PER BEST EFFORTS: 5778 1st Qu.: 25.0
## HOMEMAKER : 4674 Median : 50.0
## PHYSICIAN : 4458 Mean : 202.5
## ATTORNEY : 4056 3rd Qu.: 150.0
## (Other) :94351 Max. : 10000.0
## NA's : 11
## date
## 17-OCT-12: 4176
## 02-NOV-12: 3465
## 23-OCT-12: 3066
## 22-OCT-12: 2569
## 31-OCT-12: 2547
## 31-AUG-12: 2534
## (Other) :133122
There’s something interesting here that we can see by leaving the contributor names as factors. There are a few contributors who made nearly 100 contributions. We can also see which zipcodes had the most contributions. This reinforces my thoughts about splitting the data up into 2 different sets, one for individual contributions and another for aggregate contributions per person. If in the end our aim is to build a model for predicting contributions it might be more practical to predict total contributions rather than individual amounts. We can create an aggregate contributions dataset at the end of our data munging process when we have all the other features we want. Then we can compare feature correlations between datasets to see if either one looks more promising for model building.
Let’s coerce those features that we mentioned before to characters and make date into a date object.
to_character <- c("name", "zipcode", "date")
for(col in to_character) {
df[, col] <- as.character(df[, col])
}
df$date <- as.Date(df$date, format = "%d-%b-%y"); sapply(df, class)## candidate name city state zipcode employer
## "factor" "character" "factor" "factor" "character" "factor"
## occupation amount date
## "factor" "numeric" "Date"
Now let’s clean the zipcodes so that they all have just 5 digits, and so that we can relate them to the zipcodes in the geographic and demographic data. We also have to make sure that they are in Ohio.
df$zipcode <- substring(df$zipcode, 1, 5)
# if starts with '45' or '44' or '43' then in Ohio
zip_legit <- as.character(seq(43000, 46000, by = 1))
# convert non-legit zipcodes to 'NA' character
df$zipcode <- ifelse(df$zipcode %in% zip_legit, df$zipcode, 'NA')
summary(as.numeric(df$zipcode))## Warning in summary(as.numeric(df$zipcode)): NAs introduced by coercion
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 43000 43550 44150 44320 45210 45900 219
Looks like they are all within the correct range with the exception of the NAs. This data won’t be usable for the choropleth maps but it’s still useful for other statistics so we’ll keep it.
We do have negative amounts though, which happen to be refunds, so we’ll have to exclude those observations.
bad <- df$amount <= 0; sum(bad); df <- df[!bad, ]## [1] 1272
There were 1272 refunds excluded.
Let’s engineer some features in order to make more use of the data that we have. We’ll get gender from name and party status from candidate.
# make first name feature from full name
df$first_name <- substring(str_extract(df$name, ',\\s[A-Z]+'), 3, )
# get dataframe of unique first names and gender
gender_names_df <- gender(unique(df$first_name))
gender_names_df <- gender_names_df[, c("name", "gender")]
# merge dataframe with gender names
df <- merge(df, gender_names_df,
by.x = 'first_name', by.y = 'name', all.x = TRUE)
# get rid of first name and gender names df, make gender a factor
df$first_name <- NULL; rm(gender_names_df); df$gender <- as.factor(df$gender)
# NA gender count
print("missing gender:"); sum(is.na(df$gender))## [1] "missing gender:"
## [1] 4677
candidates <- unique(df$candidate)
democrat <- "Obama, Barack"; green <- "Stein, Jill"
republican <- candidates[!(candidates %in% c(democrat, green))]
df$party <- ifelse(df$candidate %in% republican, "republican",
ifelse(df$candidate == democrat, "democrat", "green"))
df$party <- as.factor(df$party)Now we can add demographic data to the dataset and get coordinates for the cities from the web. The cities will help us to see things better on the map.
demographics.ohio <- df_zip_demographics %>% filter(region %in% zip_legit)
df <- merge(df, demographics.ohio,
by.x = 'zipcode', by.y = 'region', all.x = TRUE)
# shorten the names
colnames(df)[12:19] <- c("population", "pcnt_wht", "pcnt_blk", "pcnt_asn",
"pcnt_hsp", "percap_incm", "med_rent", "med_age")
rm(df_zip_demographics)
# get city coordinates
webpage_ohio_cities <-
read_html("http://www.geonames.org/US/OH/largest-cities-in-ohio.html")
city_names <- webpage_ohio_cities %>%
xml_find_all("//tr/td/a[contains(@href, 'geonames')]/text()") %>%
as.character()
city_pop <- webpage_ohio_cities %>%
xml_find_all("//tr/td[contains(@class, 'rightalign')]/text()") %>%
as.character() %>%
sub(",", "", .) %>%
as.numeric()
city_coord <- webpage_ohio_cities %>%
xml_find_all('//tr/td/a[contains(@href, "maps")]/text()') %>%
as.character()
city_lat <- as.numeric(str_extract(city_coord, '^[0-9.]+'))
city_long <- as.numeric(str_extract(city_coord, '[-0-9.]+$'))
mapdata.ohio.cities <- data.frame(city_names, city_long, city_lat, city_pop)Finally, we subset the map data for Ohio only since it includes the entire United States.
mapdata.ohio <- zip.map %>% filter(region %in% zip_legit) %>% arrange(order)
rm(zip.map)
## The correlations for the aggregate dataframe were only slightly better
## and there were lots of data inconsistencies to make use of occupation
## so this was not included in further analysis.
# Finally, we can create the aggregate contributions dataframe.
# I'll consider anybody with the same name, of the same gender, in the same zipcode, who made
# contributions to the same party, the same person
# (occupation and employer proved to be inconsistent for the same person across different contributions).
if(FALSE) {
df_agg <- df
# strip name of non-alphanumeric chars and whitespace
df_agg$name <- df_agg$name %>% str_replace_all("[^[:alnum:]]", " ") %>% str_trim(side = "both")
# aggregate and set any inconsistencies in occupation to the string with lesser value (better hack?)
df_agg <- df_agg %>%
group_by(name, gender, zipcode, party, percap_incm, med_age) %>%
summarise(count = n(), total_amt = sum(amount), occupation = min(as.character(occupation)))
df_agg$occupation <- as.factor(df_agg$occupation)
df_agg <- as.data.frame(df_agg)
# summary and correlations
summary(df_agg); str(df_agg)
library(polycor); hetcor(df_agg[, c(2, 4:8)])
}
# save/load data into .Rdata file for easy start
#save(df, demographics.ohio, mapdata.ohio, mapdata.ohio.cities, file = "processed-data.Rdata")
#load("processed-data.Rdata")knitr::opts_chunk$set(echo=FALSE, warning=FALSE, messages=FALSE)Now that the data is in the right format let’s start our univariate exploration.
## [1] "zipcode" "candidate" "name" "city" "state"
## [6] "employer" "occupation" "amount" "date" "gender"
## [11] "party" "population" "pcnt_wht" "pcnt_blk" "pcnt_asn"
## [16] "pcnt_hsp" "percap_incm" "med_rent" "med_age"
We have 19 different variables,
## 'data.frame': 150207 obs. of 19 variables:
## $ zipcode : chr "43001" "43001" "43001" "43001" ...
## $ candidate : Factor w/ 14 levels "Bachmann, Michele",..: 12 7 12 12 12 12 12 12 12 12 ...
## $ name : chr "CHAULK, SARAH" "BAKER, NANCY" "TODD, DARRELL M. MR." "CHAULK, JOSEPH MR." ...
## $ city : Factor w/ 1167 levels "ABERDEEN","ADA",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ state : Factor w/ 1 level "OH": 1 1 1 1 1 1 1 1 1 1 ...
## $ employer : Factor w/ 13507 levels "","1099","121 ARW",..: 10417 994 5647 1634 5647 9804 2692 1634 1634 1634 ...
## $ occupation : Factor w/ 6846 levels "","-","100% DISABLED VIETNAM VETERAN",..: 2594 4148 2960 882 2960 5231 1282 882 882 882 ...
## $ amount : num 546 35 100 92.2 125 ...
## $ date : Date, format: "2012-10-27" "2012-06-25" ...
## $ gender : Factor w/ 2 levels "female","male": 1 1 2 2 2 2 2 2 2 2 ...
## $ party : Factor w/ 3 levels "democrat","green",..: 3 1 3 3 3 3 3 3 3 3 ...
## $ population : num 2295 2295 2295 2295 2295 ...
## $ pcnt_wht : num 93 93 93 93 93 93 93 93 93 93 ...
## $ pcnt_blk : num 0 0 0 0 0 0 0 0 0 0 ...
## $ pcnt_asn : num 0 0 0 0 0 0 0 0 0 0 ...
## $ pcnt_hsp : num 3 3 3 3 3 3 3 3 3 3 ...
## $ percap_incm: num 34306 34306 34306 34306 34306 ...
## $ med_rent : num 592 592 592 592 592 592 592 592 592 592 ...
## $ med_age : num 46.2 46.2 46.2 46.2 46.2 46.2 46.2 46.2 46.2 46.2 ...
with varying classes.
## zipcode candidate name
## Length:150207 Obama, Barack :90588 Length:150207
## Class :character Romney, Mitt :50198 Class :character
## Mode :character Paul, Ron : 4246 Mode :character
## Santorum, Rick: 2001
## Gingrich, Newt: 1391
## Cain, Herman : 581
## (Other) : 1202
## city state
## CINCINNATI: 18446 OH:150207
## COLUMBUS : 13703
## DAYTON : 5488
## CLEVELAND : 4697
## TOLEDO : 3261
## AKRON : 2940
## (Other) :101672
## employer
## RETIRED :34158
## SELF-EMPLOYED :11477
## NOT EMPLOYED : 8648
## INFORMATION REQUESTED PER BEST EFFORTS: 6030
## INFORMATION REQUESTED : 4216
## (Other) :85635
## NA's : 43
## occupation amount
## RETIRED :38029 Min. : 1.0
## INFORMATION REQUESTED PER BEST EFFORTS: 5759 1st Qu.: 25.0
## HOMEMAKER : 4623 Median : 50.0
## PHYSICIAN : 4447 Mean : 210.8
## ATTORNEY : 4033 3rd Qu.: 150.0
## (Other) :93305 Max. :10000.0
## NA's : 11
## date gender party population
## Min. :2011-01-28 female:66613 democrat :90588 Min. : 0
## 1st Qu.:2012-07-05 male :78917 green : 19 1st Qu.:16076
## Median :2012-09-17 NA's : 4677 republican:59600 Median :25049
## Mean :2012-08-05 Mean :26621
## 3rd Qu.:2012-10-17 3rd Qu.:35078
## Max. :2012-12-27 Max. :68475
## NA's :973
## pcnt_wht pcnt_blk pcnt_asn pcnt_hsp
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 76.00 1st Qu.: 2.00 1st Qu.: 1.000 1st Qu.: 1.000
## Median : 87.00 Median : 4.00 Median : 2.000 Median : 2.000
## Mean : 79.75 Mean :12.22 Mean : 2.936 Mean : 2.842
## 3rd Qu.: 93.00 3rd Qu.:13.00 3rd Qu.: 4.000 3rd Qu.: 3.000
## Max. :100.00 Max. :94.00 Max. :22.000 Max. :61.000
## NA's :987 NA's :987 NA's :987 NA's :987
## percap_incm med_rent med_age
## Min. : 864 Min. : 213.0 Min. : 6.80
## 1st Qu.:23951 1st Qu.: 550.0 1st Qu.:36.30
## Median :30291 Median : 643.0 Median :39.50
## Mean :32167 Mean : 668.4 Mean :39.25
## 3rd Qu.:38759 3rd Qu.: 749.0 3rd Qu.:43.10
## Max. :67742 Max. :1475.0 Max. :83.50
## NA's :1023 NA's :1735 NA's :992
The most popular candidate to receive contributions was Barack Obama and the most popular party was the democratic party. The city with the most contributions was Cincinnati and retired people represented the occupation with the largest amount of contributions. The demographic of males made more contributions than females. The median contribution was $50 and the average is $211.
We have long-tailed data for the amount of contributions. There are so many contributions made under $1,000 that it’s hard to see any of the outliers. A logarithmic transformation of the x-axis reveals something of a log-normal distribution for the amount.
There are a significant group of people (4195) that, despite the long-tailed distribution, contribute $2,500. It may be that there is a tax write-off reason for this. Looking closer we can see that there appear to be several discrete values in increments of $50 that people are accustomed to contributing. Under $60, we can see contributions spaced in intervals of $5.
An overwhelming majority of contributions (93.5%) were made in 2012, and we can see a steady increase in contributions leading up to the election in November. There seems to be a slight increase in the amount of contributions made toward the end of the month. There is also a peak at about halfway through the month. People might be making contributions immediately after receiving their paycheck or pension.
Although a significant difference between the amount of male and female contributions, the proportion (0.54 in favor of males) of the gap is not very large.
Most contributions are made to either Obama or Romney. Using a log scale we can see the other candidates a little better. Cumulatively, there are a significant amount of contributions made to other candidates but they are mostly Republican. The proportion of contributions to democrats vs. republicans seems to resemble the previous histrogram between Obama and Romney. The amount of contributions to the green party is so small (19 contributions) that we might want to exclude for simplicity.
It’s important to remember that the remaining demographic variables correspond to the contributor’s zipcode and not to the contributor him/herself.
The distribution of population in which contributors live is fairly normal with mean 2.662110^{4} and median 2.504910^{4}.
Most contributors live in areas with a high percentage of white ethnicity and low percentages of black, asian, or hispanic ethnicities.
The distribution of per capita income in which contributors live appears poisson, with the bulk of contributors living in zipcodes with a range of per capita income of about $20,000 - $40,000. The average is $3.216710^{4} and the median is $3.029110^{4}.
We see a fairly normal distribution with median rent in the locations in which contributors live. Rent is very cheap (median of $643) as compared to California but per capita income is also lower.
The median age in which contributors live also resembles a normal distrubtion with median 39.5 Looking back at the summary statistics we can see that most contributions come from areas where the median age is over 40 (39.5 precisely). I could not find exact statistics for the median age in Ohio in 2012, but in 2015 it appears to be 38.8. We might be tempted to say that contributions come from areas where the population is older which would make sense given the high number of retiree contributions. However whether or not the difference is significant would need some inferential analysis which is beyond the scope of this project.
A choropleth map of total contributions by zipcode shows that contributions generally come from areas nearer to cities. There are also quite a few zipcodes without any contributions (seen with the same background color).
There are 151,479 instances of campaign contribution in the dataset with 19 features. From the original data set 11 features were kept or derived:
Using the zipcode feature, demographic information was added from another data set:
From the original data set, all but name, amount, and date are factors. None of the factors are ordered. Name is a character, amount is numeric, and date is a date object. The 8 demographic features are all numeric.
Other observations:
The main features of interest in the data set are amount, gender, party, per capita income, and median age. I would like to see if these factors are correlated with contribution amount. Occupation could be of interest however there are too many levels (6,846).
I believe that the percent ethnicities, total population, and median rent may be correlated with contribution amount.
I created two new variables, one for the gender of the contributor based upon the first name, and another for the party of the contributor based upon the candidate that received the contribution. I was unable to programatically determine gender by first name for about 4,742 instances (approx. 3% of the data).
A log transformation of contribution amount revealed a log-normal distribution. Despite this, we can see in the non-skewed distribution that there are a significant group of people that donate the maximum allowable campaign contribution by law (approx. $2,600). There are also Political Action Committee (PAC) data in the set which have a larger limit (approx. $5,000). I am unsure of the validity of the outliers beyond this amount because of my limited knowledge of campaign finance law. That there were several negative amounts which needed to be corrected to positive leads me to believe that there could be further inaccuracies in the data set.
The data came in a tidy format and did not need to be transformed.
Before I plot any bivariate relationships, I want to see a correlation matrix of my variables of interest. This will help to focus the rest of my exploration and cut down on uninformative plots.
##
## Two-Step Estimates
##
## Correlations/Type of Correlation:
## amount gender party population pcnt_wht
## amount 1 Polyserial Polyserial Pearson Pearson
## gender 0.126 1 Polychoric Polyserial Polyserial
## party 0.4541 0.3664 1 Polyserial Polyserial
## population -0.05573 -0.003391 -0.03011 1 Pearson
## pcnt_wht 0.03888 0.07609 0.3292 -0.06022 1
## pcnt_blk -0.03743 -0.0805 -0.3258 -0.006 -0.9711
## pcnt_asn 0.03653 0.01284 -0.03036 0.1826 -0.09309
## pcnt_hsp -0.03344 0.005547 -0.07443 0.1856 -0.2223
## percap_incm 0.159 0.02015 0.1397 -0.02979 0.2596
## med_rent 0.07272 0.01862 0.06658 0.1581 0.1251
## med_age 0.08183 0.0188 0.2056 -0.1739 0.2695
## time -0.1117 -0.036 -0.09182 0.03104 0.01332
## pcnt_blk pcnt_asn pcnt_hsp percap_incm med_rent
## amount Pearson Pearson Pearson Pearson Pearson
## gender Polyserial Polyserial Polyserial Polyserial Polyserial
## party Polyserial Polyserial Polyserial Polyserial Polyserial
## population Pearson Pearson Pearson Pearson Pearson
## pcnt_wht Pearson Pearson Pearson Pearson Pearson
## pcnt_blk 1 Pearson Pearson Pearson Pearson
## pcnt_asn -0.08454 1 Pearson Pearson Pearson
## pcnt_hsp 0.06317 0.07302 1 Pearson Pearson
## percap_incm -0.3128 0.4696 -0.1276 1 Pearson
## med_rent -0.1987 0.5138 -0.04807 0.7485 1
## med_age -0.1873 -0.2226 -0.2259 0.3313 0.1663
## time -0.01758 0.01289 0.006405 0.002254 0.01291
## med_age time
## amount Pearson Pearson
## gender Polyserial Polyserial
## party Polyserial Polyserial
## population Pearson Pearson
## pcnt_wht Pearson Pearson
## pcnt_blk Pearson Pearson
## pcnt_asn Pearson Pearson
## pcnt_hsp Pearson Pearson
## percap_incm Pearson Pearson
## med_rent Pearson Pearson
## med_age 1 Pearson
## time -0.003134 1
##
## Standard Errors:
## amount gender party population pcnt_wht pcnt_blk
## amount
## gender 0.003466
## party 0.003641 0.003804
## population 0.002628 0.003312 0.003334
## pcnt_wht 0.002632 0.003286 0.003361 0.002627
## pcnt_blk 0.002633 0.003286 0.003523 0.002636 0.0001501
## pcnt_asn 0.002633 0.003318 0.003351 0.002549 0.002614 0.002618
## pcnt_hsp 0.002633 0.003315 0.00348 0.002546 0.002506 0.002626
## percap_incm 0.00257 0.003311 0.003263 0.002634 0.002459 0.002378
## med_rent 0.002622 0.003316 0.003304 0.00257 0.002595 0.002532
## med_age 0.002619 0.003309 0.003278 0.002557 0.002445 0.002544
## time 0.002604 0.003321 0.003276 0.002634 0.002636 0.002636
## pcnt_asn pcnt_hsp percap_incm med_rent med_age
## amount
## gender
## party
## population
## pcnt_wht
## pcnt_blk
## pcnt_asn
## pcnt_hsp 0.002622
## percap_incm 0.002055 0.002593
## med_rent 0.00194 0.00263 0.001159
## med_age 0.002506 0.002502 0.002347 0.002564
## time 0.002636 0.002636 0.002636 0.002636 0.002636
##
## n = 143873
##
## P-values for Tests of Bivariate Normality:
## amount gender party population pcnt_wht pcnt_blk pcnt_asn
## amount
## gender 0
## party 0 0.3947
## population 0 0 0
## pcnt_wht 0 0 0 0
## pcnt_blk 0 0 0 0 0
## pcnt_asn 0 0 0 0 0 0
## pcnt_hsp 0 0 0 0 0 0 0
## percap_incm 0 0 0 0 0 0 0
## med_rent 0 0 0 0 0 0 0
## med_age 0 0 0 0 0 0 0
## time 0 0 0 0 0 0 0
## pcnt_hsp percap_incm med_rent med_age
## amount
## gender
## party
## population
## pcnt_wht
## pcnt_blk
## pcnt_asn
## pcnt_hsp
## percap_incm 0
## med_rent 0 0
## med_age 0 0 0
## time 0 0 0 0
Party, percapita income of the contributor’s zipcode, gender, and time are the only variables with a correlation coefficient above 0.10. Time has a negative correlation with amount indicating that despite the increase in contributions leading up to the election, amount of a contribution seems to wane as the election approaches. I’ve never been to Ohio but because of the strong negative correlation between percentage of white and black ethnicity for a contributor’s zipcode, I would think that it is somewhat segregated.
##
## One-way analysis of means
##
## data: df$amount and as.factor(year(df$date))
## F = 522.96, num df = 1, denom df = 150200, p-value < 2.2e-16
Contributions really ramp up in the summer before the election and the average contribution amount is significantly higher in 2012 than in 2011.
##
## One-way analysis of means
##
## data: df$amount and df$gender
## F = 1310.8, num df = 1, denom df = 145530, p-value < 2.2e-16
Contributions to the Democratic party came from a slight female majority whereas contributions to the Republican party came from an overwhelming male majority. The mean contribution amount as well as the IQ range is larger for males than females.
##
## One-way analysis of means
##
## data: df$amount and df$party
## F = 6963.4, num df = 2, denom df = 150200, p-value < 2.2e-16
This definitely resembles the previous plot. Republicans also have a higher mean contribution amount and IQ range than democrats.
## [1] 0.1585001
This is a weak correlation as seen by the points in the plot and the trend is probably due to the higher per capita income demographic around 60K.
## [1] -0.1101269
It’s interesting to see that there is a second peak on this plot. Almost like a final rally in contribution amount just before the election.
Next I want to look at total contribution amounts by party on the map. My guess is that rural areas will show more Republican contributions. The colors for each map correspond to percentile buckets with cuts at 15%, 30%, 45%, 55%, 70%, %85, and 100%.
Both of the maps seem to have the heaviest concentrations of total amounts coming from city areas but it does look like there are more dark colors away from cities for the republican party.
Relatively speaking, there were no strong relationship discovered in the numerical correlations. There were however significant correlations (abs. value > 3%) among all of the variables and amount. Per capita income had the strongest correlation (16%) followed by time (numerical date), median age, median rent, total population, percent asian, percent white, percent black, and percent hispanic with the lowest (-3.25%).
Time, total population, percent black, and percent hispanic were all negatively correlated. Among the ordered factors of gender and party, republican contributions were on average higher than democrat, as were male contributions higher than female.
I did not expect that time and total population would be correlated with contribution amount. It appears that early on contributions are largest, which might make sense to support a candidate for a longer campaign. Total population seems a bit arbitrary as zipcodes are not necessarily zoned for equal area.
The strongest relationship among all the variables was that between percent white and percent black of a contributors location. These are negatively correlated at 97%. Among the variables of interest the strongest correlation was between per capita income and amount which I suspected to be so.
Now I’ll take a look at relationships between amount and a couple of a my highly correlated features from the bivariate exploration. There are a lot of contributions in the dataset and with scatterplots it’s difficult to see the relationships especially since the correlations are weak, so I’ll be using smoothing lines instead.
The first plot of amount by per capita income faceted by gender doesn’t really give us any new information. Both lines for gender have a similar slope and mirror each other. The vertical translation between the lines only highlights what we saw before in the bivariate section with respect to the differences in mean contribution amount across gender. The second plot, however, seems to show that as the per capita income of a contributor’s zipcode increases, republicans make donations in larger amounts as compared to democrats. The high starting point for the LOESS smoothing line in both plots looks like it’s due to some outliers who donated large amounts but who reportedly reside in areas with low per capita income.
As in the previous plot grid, time doesn’t seem to be a factor across gender for contribution amount. Converseley, there is a difference across party, and it looks as if contribution amounts for republicans have 2 peaks. The first peak is somewhat shared for both parties and corresponds to what we saw before with larger contribution amounts in 2011 vs 2012. The second peak is pretty exclusive to the republican party though, and it looks like republicans donate in larger amounts right before the election as compared to democrats.
I want to look at the average contribution by zipcode to see if there is any difference between urban and suburban/rural areas.
It looks like the areas with the highest average contribution amounts aren’t necessarily closest to cities. If 1 on 1 interaction between candidates and people is what gets donations for a campaign, then time would be best spent in these outer city areas.
I want to build a linear model of my most highly correlated features to try and predict amount. I’ll be predicting the log of amount since we saw that the distribution was log-normal. I’ll also add features one by one to my model in order of highest correlation to see if there are improvements in performance.
##
## Calls:
## m1: lm(formula = log(amount) ~ party, data = model_df)
## m2: lm(formula = log(amount) ~ party + percap_incm, data = model_df)
## m3: lm(formula = log(amount) ~ party + percap_incm + gender, data = model_df)
## m4: lm(formula = log(amount) ~ party + percap_incm + gender + time,
## data = model_df)
##
## ==========================================================================
## m1 m2 m3 m4
## --------------------------------------------------------------------------
## (Intercept) 3.791*** 3.316*** 3.220*** 27.722***
## (0.004) (0.010) (0.011) (0.460)
## party: republican/democrat 1.112*** 1.074*** 1.025*** 1.000***
## (0.007) (0.007) (0.007) (0.007)
## percap_incm 0.000*** 0.000*** 0.000***
## (0.000) (0.000) (0.000)
## gender: male/female 0.206*** 0.202***
## (0.007) (0.007)
## time -0.000***
## (0.000)
## --------------------------------------------------------------------------
## R-squared 0.2 0.2 0.2 0.2
## adj. R-squared 0.2 0.2 0.2 0.2
## sigma 1.2 1.2 1.2 1.2
## F 27956.3 15536.2 10747.7 8928.6
## p 0.0 0.0 0.0 0.0
## Log-likelihood -236032.7 -234738.7 -234258.3 -232853.0
## Deviance 221771.5 217836.1 216393.0 212225.7
## AIC 472071.3 469485.4 468526.6 465718.0
## BIC 472100.9 469524.9 468576.0 465777.2
## N 144541 144541 144541 144541
## ==========================================================================
## Analysis of Variance Table
##
## Model 1: log(amount) ~ party
## Model 2: log(amount) ~ party + percap_incm
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 144539 221771
## 2 144538 217836 1 3935.4 2611.2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: log(amount) ~ party + percap_incm
## Model 2: log(amount) ~ party + percap_incm + gender
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 144538 217836
## 2 144537 216393 1 1443.1 963.88 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: log(amount) ~ party + percap_incm + gender
## Model 2: log(amount) ~ party + percap_incm + gender + time
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 144537 216393
## 2 144536 212226 1 4167.2 2838.1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = log(amount) ~ party + percap_incm + gender + time,
## data = model_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3095 -0.8334 -0.0842 0.7320 4.8197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.772e+01 4.600e-01 60.26 <2e-16 ***
## partyrepublican 1.000e+00 6.749e-03 148.20 <2e-16 ***
## percap_incm 1.553e-05 2.948e-07 52.66 <2e-16 ***
## gendermale 2.022e-01 6.580e-03 30.72 <2e-16 ***
## time -1.822e-08 3.421e-10 -53.27 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.212 on 144536 degrees of freedom
## Multiple R-squared: 0.1981, Adjusted R-squared: 0.1981
## F-statistic: 8929 on 4 and 144536 DF, p-value: < 2.2e-16
The analysis of variance tables show that each feature added was significant and helped to improve the fit of the model. Even so, the correlation coefficients for per capita income and time were very small compared to party and gender. This was surprising as per capita income had a stronger correlation to amount than gender. Looking at the residuals it seems as though there is a a kind of slope towards negative error. The standardized residuals are very close to normal which indicates that our linear model to predict … Despite this, the linear model only accounted for about 20% of the variation in donation amount which means that our features are still poor predictors for contribution amount.
I looked at amount against all of the significantly correlated features but adding in gender and/or party as a third variable. In nearly all of the comparisons gender and party proved to be a significant in differentiating total contribution amount. Specifically, contribution amount was higher for males than females as was it for Republicans versus Democrats. Also, because of the majority male constituency for Republican doners, we see that male trends generally mirror Republican trends as do female trends mirror Democratic trends.
I knew that gender and party might be significant factors but I did not know to what extent (these features could not be analyzed in the correlation table). I was somewhat surprised to see that differences in gender and party were universal across all other features with respect to contribution amount. An interesting finding was that Democratic contribution amounts show little increase with increasing per capita income of the contributors demographic as compared to Republican contribution amounts. Also, looking at contribution amount over time by party showed that, despite both parties having larger contribution amounts earlier on, Republicans increased their contribution amount leading up to the election whereas Democrats do not. The same is true for males over females but the relationship is less pronounced. This sort of last minute increase in contribution amount reminds me of a type of rally behavior. Whether or not this is effective in catapulting a candidate to nomination is a whole other question altogether but I doubt it to be so (especially since Romney lost Ohio in 2012). Another interesting difference between males and females is that as the median age of the demographic of the contributor increases, male contribution amount tends to increase whereas female contribution shows a slight decrease. The same idea applies for Republicans and Democrats, Republicans showing an increase in amount as the median age of the contributor’s zipcode increases but holding steady for Democrats. If median age of the contributors zipcode did in fact reflect the actual age of the contributor, we could hypothesize that females are less inclined to donate large amounts as they get older.
I experimented with several different models and found that one which modeled the log of amount with per capita income, gender, and party was the most effective in explaining variance in contribution amount. I tried to include time as a factor because it was highly negatively correlated with amount however adding this feature only decreased the R-squared value. Adding all of the remaining significantly correlated features had the same effect to decrease the R-squared value.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 25.0 50.0 210.8 150.0 10000.0
The first plot shows both how contribution amounts are distributed log-normally and how contributions increase leading up to an election.
This plot grid shows that, in Ohio, most Democratic contributions were made by females and an overwhelming majority of Republican contributions were made by males. Also that, in general, Republican and male contribution amounts are higher than Democrat and female contribution amounts. An interesting rally phenomenon can be seen here with Republican party as the election date approaches.
This final plot shows that, despite total contribution amount being highly correlated with city center, average contribution amount is highest surrounding a city and also in some rural areas. This may shed light on why presidential candidates spend a significant amount of time campaigning in suburbs and seemingly rural areas.
In my investigation of 2012 Presidential Campaign Contributions for the state of Ohio, I chose to focus on finding the most significant features of a contributors information that could be used to predict the actual contribution amount. The most significant features proved to be per capita income of the area in which the contributor lives (which we can assume gives an idea of the contributor him or herself), the gender of the contributor, and the political party affiliation of the contributor (simplified to be either Democrat or Republican). Per capita income of the contributor’s zipcode has a positive correlation with the contributor’s contribution amount. Males have on average higher contribution amounts than females as do Republicans versus Democrats. This would lead one to conclude that given these correlated features, the highest contribution amount could belong to a male Republican who lives in an area with high per capita income. The lowest contribution amount might belong to a contributor who is a female Democrat and who lives in an area with a low per capita income.
Despite these findings, the model that was developed was only able to account for about 18% of the variation in contribution amount. It would have been great to have actual income, age, and ethnicity of the contributor him/herself. I believe that these would have had a much higher correlation than the demographic information of the contributors zipcode. The demographic information was at best a rough approximation of the contributor.
With regard to the choropleth maps, it seemed apparent that total contributions and total contribution amount were highly correlated with city proximity. Average contribution appeared to be higher closer to cities, but with a buffer between the actual city center and high average contribution amounts.
There are several shortcomings of the data set. First, I question the validity of some of the information as several contribution amounts had to be changed from negative to positive. Second, as compared to other zipcodes, some lacked a substantial amount of data. This may have skewed the average contribution choropleth map. Another shortcoming was the inability to programatically determine the gender by first name of the contributor for about 4% of the data. I had a difficult time with regex in R, were I more adept at this, that data might have been included.
If possible, further analysis could include distance, or some measure of proximity, to a city center. The choropleth maps that were generated attempted to show a spatial relationship between amount and cities. This was however at best an approximation without any concrete measurements to back-up the claims/insights. To do this, an average latitude and longitude value could be calculated and added as a variable for each zipcode, and another variable could be added for distance to the closest city.